Introduction

Welcome to a data analysis case study on Cyclistic’s bike sharing. In this case study, I will work as a junior data analyst for a fictitious organization called Cyclistic. The company’s stakeholder believes that the company’s future success depends on the annual membership subscription rate. As a junior data analyst working in the company’s marketing department, my team wants to understand how annual and casual members use Cyclistic bikes differently and provide. From these insights, my team is required to develop new marketing strategies to increase the sales of annual memberships.

About the company

Cyclistic introduced a popular bike-share program in 2016. The initiative has expanded since then to include a fleet of 5,824 bicycles that are geo-tracked and locked into a system of 692 stations throughout Chicago. The bikes may be released from one station and brought back to any other station in the network. Up to this point, Cyclistic’s marketing approach focused on raising public awareness and appealing to a wide range of consumer groups. The price plans’ flexibility, which included single-ride passes, full-day passes, and annual memberships, was one strategy that assisted in making these things possible. Casual riders are those who buy one-ride or all-day passes from the company. Cyclistic members are customers who purchase annual memberships. Although the pricing flexibility helps Cyclistic attract more customers, especially casual riders, Cyclistic’s finance analysts have concluded that annual members are much more profitable. The company believes that, rather than creating a marketing campaign to attract new casual riders, it should focus on converting casual riders into members as they are already aware of the annual membership program and have chosen Cyclistic for their mobility needs. The marketing analyst team needs to have a deeper understanding of the distinctions between annual members and casual riders, the motivations behind why casual riders would purchase a membership, and the potential impact of digital media on their marketing strategies.

Phase 1:- Ask

In phase 1, our team was asked to address three questions that will guide the future marketing program: How do annual members and casual riders use Cyclistic bikes differently? Why would casual riders buy Cyclistic annual memberships? How can Cyclistic use digital media to influence casual riders to become members? I was assigned the first question: How do annual members and casual riders use Cyclistic bikes differently? To understand the behavior of the riders, I decided to analyze how differently annual members and casual riders use the Cyclistic bikes during a year, which type of bike they prefer most, and the most popular stations.

Business Task

Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. So the marketing analysts team is requested to understand the behavior of Cyclistic’s customers and provide solutions with recommendations with supporting data insights and professional data visualizations.

Phase 2:- Prepare

For this case study, the dataset is provided by the Cyclistic itself as CSV files. The dataset is organized according to months, and we will use one year of data from July 2021 to June 2022. As the data is sourced from the first party, the data’s credibility is excellent, licensed, and cited.

The dataset was downloaded and stored locally on a hard drive and saved as CSV files. The dataset has a total of 13 columns and 5,900,385 rows.

Source of the dataset:- https://divvy-tripdata.s3.amazonaws.com/index.html

Following are the columns:-

  1. “ride_id” - Unique identification number for each rider.
  2. “rideable_type” - Type of bike
  3. “started_at” - Starting date and time of the ride
  4. “ended_at” - Ending date and time of the ride
  5. “start_station_name” - Starting station name for the ride
  6. “start_station_id” - Unique identification number of Starting station
  7. “end_station_name”- Ending station name for the ride
  8. “end_station_id” - Unique identification number of Ending station
  9. “start_lat” - Latitude of the starting station
  10. “start_lng” - Longitude of the starting station
  11. “end_lat” - Latitude of the ending station
  12. “end_lng” - Longitude of the ending station
  13. “member_casual” - Type of customer

Phase 3:- Process

We will use R studio for Cleaning, analyzing, and visualizing this case study. MSExcel could not handle the dataset as it is large.

Initialising the Environment

For this case study, the following packages will be used.

#installing required libraries
# install.packages("tidyverse")
# install.packages("lubridate")
# install.packages("ggplot2")
# install.packages("dplyr")
# install.packages("geosphere")
# install.packages("skimr")
# install.packages("mapview")


#lording libraries
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.7     ✔ dplyr   1.0.9
## ✔ tidyr   1.2.0     ✔ stringr 1.4.0
## ✔ readr   2.1.2     ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(lubridate)
## 
## Attaching package: 'lubridate'
## 
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(ggplot2)
library(dplyr)
library(geosphere)
library(skimr)
library(mapview)

Load dataset for one year from July 2021 to June 2022.

#lording data into bike_share
bike_share<- list.files(path="C:/Users/akhil/OneDrive/Desktop/Case Study/data/trip_data", full.names = TRUE) %>% 
  lapply(read_csv) %>% 
  bind_rows 
## Rows: 822410 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 804352 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 756147 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 631226 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 359978 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 247540 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 103770 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 115609 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 284042 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 371249 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 634858 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## Rows: 769204 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): ride_id, rideable_type, start_station_name, start_station_id, end_...
## dbl  (4): start_lat, start_lng, end_lat, end_lng
## dttm (2): started_at, ended_at
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Description of dataset

head(bike_share)
## # A tibble: 6 × 13
##   ride_id rideable_type started_at          ended_at            start_station_n…
##   <chr>   <chr>         <dttm>              <dttm>              <chr>           
## 1 0A1B62… docked_bike   2021-07-02 14:44:36 2021-07-02 15:19:58 Michigan Ave & …
## 2 B2D558… classic_bike  2021-07-07 16:57:42 2021-07-07 17:16:09 California Ave …
## 3 6F2645… classic_bike  2021-07-25 11:30:55 2021-07-25 11:48:45 Wabash Ave & 16…
## 4 379B58… classic_bike  2021-07-08 22:08:30 2021-07-08 22:23:32 California Ave …
## 5 6615C1… electric_bike 2021-07-28 16:08:06 2021-07-28 16:27:09 California Ave …
## 6 62DC2B… electric_bike 2021-07-29 17:09:08 2021-07-29 17:15:00 California Ave …
## # … with 8 more variables: start_station_id <chr>, end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>
tail(bike_share)
## # A tibble: 6 × 13
##   ride_id rideable_type started_at          ended_at            start_station_n…
##   <chr>   <chr>         <dttm>              <dttm>              <chr>           
## 1 F1C4F6… electric_bike 2022-06-26 19:50:04 2022-06-26 19:55:23 Clarendon Ave &…
## 2 7B3B28… classic_bike  2022-06-25 00:56:48 2022-06-25 01:01:39 Sheffield Ave &…
## 3 1E9939… classic_bike  2022-06-25 00:56:25 2022-06-25 01:00:26 Sheffield Ave &…
## 4 AEA166… electric_bike 2022-06-12 12:47:12 2022-06-12 12:47:36 Milwaukee Ave &…
## 5 B9F527… classic_bike  2022-06-12 13:28:46 2022-06-12 13:53:11 Clark St & Rand…
## 6 D24131… classic_bike  2022-06-12 14:40:51 2022-06-12 15:08:14 Blue Island Ave…
## # … with 8 more variables: start_station_id <chr>, end_station_name <chr>,
## #   end_station_id <chr>, start_lat <dbl>, start_lng <dbl>, end_lat <dbl>,
## #   end_lng <dbl>, member_casual <chr>
colnames(bike_share)
##  [1] "ride_id"            "rideable_type"      "started_at"        
##  [4] "ended_at"           "start_station_name" "start_station_id"  
##  [7] "end_station_name"   "end_station_id"     "start_lat"         
## [10] "start_lng"          "end_lat"            "end_lng"           
## [13] "member_casual"
dim(bike_share)
## [1] 5900385      13
str(bike_share)
## spec_tbl_df [5,900,385 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ ride_id           : chr [1:5900385] "0A1B623926EF4E16" "B2D5583A5A5E76EE" "6F264597DDBF427A" "379B58EAB20E8AA5" ...
##  $ rideable_type     : chr [1:5900385] "docked_bike" "classic_bike" "classic_bike" "classic_bike" ...
##  $ started_at        : POSIXct[1:5900385], format: "2021-07-02 14:44:36" "2021-07-07 16:57:42" ...
##  $ ended_at          : POSIXct[1:5900385], format: "2021-07-02 15:19:58" "2021-07-07 17:16:09" ...
##  $ start_station_name: chr [1:5900385] "Michigan Ave & Washington St" "California Ave & Cortez St" "Wabash Ave & 16th St" "California Ave & Cortez St" ...
##  $ start_station_id  : chr [1:5900385] "13001" "17660" "SL-012" "17660" ...
##  $ end_station_name  : chr [1:5900385] "Halsted St & North Branch St" "Wood St & Hubbard St" "Rush St & Hubbard St" "Carpenter St & Huron St" ...
##  $ end_station_id    : chr [1:5900385] "KA1504000117" "13432" "KA1503000044" "13196" ...
##  $ start_lat         : num [1:5900385] 41.9 41.9 41.9 41.9 41.9 ...
##  $ start_lng         : num [1:5900385] -87.6 -87.7 -87.6 -87.7 -87.7 ...
##  $ end_lat           : num [1:5900385] 41.9 41.9 41.9 41.9 41.9 ...
##  $ end_lng           : num [1:5900385] -87.6 -87.7 -87.6 -87.7 -87.7 ...
##  $ member_casual     : chr [1:5900385] "casual" "casual" "member" "member" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   ride_id = col_character(),
##   ..   rideable_type = col_character(),
##   ..   started_at = col_datetime(format = ""),
##   ..   ended_at = col_datetime(format = ""),
##   ..   start_station_name = col_character(),
##   ..   start_station_id = col_character(),
##   ..   end_station_name = col_character(),
##   ..   end_station_id = col_character(),
##   ..   start_lat = col_double(),
##   ..   start_lng = col_double(),
##   ..   end_lat = col_double(),
##   ..   end_lng = col_double(),
##   ..   member_casual = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(bike_share)
##    ride_id          rideable_type        started_at                    
##  Length:5900385     Length:5900385     Min.   :2021-07-01 00:00:22.00  
##  Class :character   Class :character   1st Qu.:2021-08-26 07:57:58.00  
##  Mode  :character   Mode  :character   Median :2021-10-27 17:35:55.00  
##                                        Mean   :2021-12-12 00:11:36.51  
##                                        3rd Qu.:2022-04-25 13:41:23.00  
##                                        Max.   :2022-06-30 23:59:58.00  
##                                                                        
##     ended_at                      start_station_name start_station_id  
##  Min.   :2021-07-01 00:04:51.00   Length:5900385     Length:5900385    
##  1st Qu.:2021-08-26 08:11:00.00   Class :character   Class :character  
##  Median :2021-10-27 17:49:46.00   Mode  :character   Mode  :character  
##  Mean   :2021-12-12 00:31:53.47                                        
##  3rd Qu.:2022-04-25 13:57:17.00                                        
##  Max.   :2022-07-13 04:21:06.00                                        
##                                                                        
##  end_station_name   end_station_id       start_lat       start_lng     
##  Length:5900385     Length:5900385     Min.   :41.64   Min.   :-87.84  
##  Class :character   Class :character   1st Qu.:41.88   1st Qu.:-87.66  
##  Mode  :character   Mode  :character   Median :41.90   Median :-87.64  
##                                        Mean   :41.90   Mean   :-87.65  
##                                        3rd Qu.:41.93   3rd Qu.:-87.63  
##                                        Max.   :45.64   Max.   :-73.80  
##                                                                        
##     end_lat         end_lng       member_casual     
##  Min.   :41.39   Min.   :-88.97   Length:5900385    
##  1st Qu.:41.88   1st Qu.:-87.66   Class :character  
##  Median :41.90   Median :-87.64   Mode  :character  
##  Mean   :41.90   Mean   :-87.65                     
##  3rd Qu.:41.93   3rd Qu.:-87.63                     
##  Max.   :42.17   Max.   :-87.49                     
##  NA's   :5374    NA's   :5374
skim_without_charts(bike_share)
Data summary
Name bike_share
Number of rows 5900385
Number of columns 13
_______________________
Column type frequency:
character 7
numeric 4
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1.00 16 16 0 5900385 0
rideable_type 0 1.00 11 13 0 3 0
start_station_name 836018 0.86 3 64 0 1293 0
start_station_id 836015 0.86 3 44 0 1157 0
end_station_name 892103 0.85 9 64 0 1315 0
end_station_id 892103 0.85 3 44 0 1171 0
member_casual 0 1.00 6 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
start_lat 0 1 41.90 0.05 41.64 41.88 41.90 41.93 45.64
start_lng 0 1 -87.65 0.03 -87.84 -87.66 -87.64 -87.63 -73.80
end_lat 5374 1 41.90 0.05 41.39 41.88 41.90 41.93 42.17
end_lng 5374 1 -87.65 0.03 -88.97 -87.66 -87.64 -87.63 -87.49

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2021-07-01 00:00:22 2022-06-30 23:59:58 2021-10-27 17:35:55 4924385
ended_at 0 1 2021-07-01 00:04:51 2022-07-13 04:21:06 2021-10-27 17:49:46 4924865

Data cleaning.

In this stage, we will clean the data, create a few new variables for further calculation, analyze, and convert the data type into a few columns. As some of the data in the data set were in uppercase, it is better to convert all the data into lowercase to achieve consistency and avoid redundancy.

#converting data into lower case
bike_share$ride_id <- tolower(bike_share$ride_id )
bike_share$rideable_type <- tolower(bike_share$rideable_type )
bike_share$start_station_name <- tolower(bike_share$start_station_name )
bike_share$start_station_id <- tolower(bike_share$start_station_id )
bike_share$end_station_name <- tolower(bike_share$end_station_name )
bike_share$end_station_id  <- tolower(bike_share$end_station_id  )
bike_share$member_casual <- tolower(bike_share$member_casual)

Cyclistic offer three types of bike for the customer.

unique(bike_share$rideable_type)
## [1] "docked_bike"   "classic_bike"  "electric_bike"

Cyclistic has two types of consumers.

unique(bike_share$member_casual)
## [1] "casual" "member"

Changing the datatype and defining new Columns

#explicitly changing the data type
bike_share$start_date <- as.Date(bike_share$started_at, format = "%m/%d/%Y")
bike_share$start_time <- as.POSIXct(bike_share$started_at,format="%H:%M:%S")
bike_share$end_date <- as.Date(bike_share$ended_at, format = "%m/%d/%Y")
bike_share$end_time <-  as.POSIXct(bike_share$ended_at,format="%H:%M:%S")




#defining new Columns for weekday, month, total time, and  distance rode
bike_share$day <- weekdays(as.Date(bike_share$start_date ))
bike_share$month <- months(as.Date(bike_share$start_date ))
bike_share$ride_length <-difftime(bike_share$end_time,bike_share$start_time,units = "secs")
bike_share$ride_length <- as.numeric(as.character(bike_share$ride_length))
bike_share$ride_distance <- distGeo(matrix(c(bike_share$start_lng, bike_share$start_lat), ncol=2), matrix (c(bike_share$end_lng, bike_share$end_lat), ncol=2))
bike_share$ride_distance <- bike_share$ride_distance/1000

Phase 4:- Analyze

The next step is to aggregate the data into a useful and accessible state, organize and format the data as per requirements, perform some calculations, and identify trends and relationships.

Removing rows where ride length is less than or equal to 0 since the data is inaccurate.

#removing rows with total time rode less than or equal to 0 second
bike_share <- bike_share %>%
  arrange(ride_length) 
bike_share_duplicate  <- bike_share[!(bike_share$ride_length <=0),]
bike_share_duplicate  <- bike_share_duplicate[!(bike_share_duplicate$start_lng == -73.79647698 | bike_share_duplicate$start_lat == 45.635034323),]

Removing rows with null values in the latitude and longitude of the end station.

#removing rows with null values in end_lat and end_lng  
bike_share_duplicate<- bike_share_duplicate[-which(is.na(bike_share_duplicate$end_lat| bike_share_duplicate$end_lng)), ]

Creating a box plot to identify the outlier.

boxplot(bike_share_duplicate$ride_length,
        ylab = "bike_share_duplicate$ride_length")

Distribution of rides over time into several sectors to understand the distribution of the ride length in second.

#plotting a pie chart to check the distribution of rides over time
bike_share_pie1 <- bike_share_duplicate %>%
  mutate(ranges = cut(ride_length, c(0, 60, 300, 600, 3600, 7200, 14400, Inf))) %>% 
  group_by(ranges) %>% 
  summarize(number=n())

pie_lables1<- paste0(round(100 * bike_share_pie1$number / sum(bike_share_pie1$number), 1),"%")
pie(bike_share_pie1$number, labels =pie_lables1 , main = "Distribution of ride over time",
    col = rainbow(length(bike_share_pie1$number)))
legend("topright",c( "(0,60]", "(60,300]", "(300,600]", "(600,3.6e+03]", "(3.6e+03,7.2e+03]", "(7.2e+03,1.44e+04]", "(1.44e+04,Inf]"), cex = 0.8, fill = rainbow(length(bike_share_pie1$number)))

Removing rows where ride length is less than one minute since the data is irrelevant.

#filtering out the data  from 61 seconds on words in respect to ride_length  
bike_share_duplicate <- subset(bike_share_duplicate, ride_length >60)
bike_share_duplicate <- bike_share_duplicate %>%
  arrange(-ride_length) 

#plotting a pie chart to check the distribution of rides between 61 and 3600
bike_share_pie2 <- bike_share_duplicate %>%
  mutate(ranges = cut(ride_length, c(60, 300, 600, 3600, 7200, 14400, Inf))) %>% 
  group_by(ranges) %>% 
  summarize(number=n())


pie_lables2<- paste0(round(100 * bike_share_pie2$number / sum(bike_share_pie2$number), 1),"%")
pie(bike_share_pie2$number, labels = pie_lables2, main = "Distribution of ride over time after filter",
    col = rainbow(length(bike_share_pie2$number)))
legend("topright",c(  "(60,300]", "(300,600]", "(600,3.6e+03]", "(3.6e+03,7.2e+03]", "(7.2e+03,1.44e+04]", "(1.44e+04,Inf]"), cex = 0.8, fill = rainbow(length(bike_share_pie2$number)))

The below box plot shows outliers. But as the data is relevant, the data cannot be dropped.

boxplot(bike_share_duplicate$ride_length,
        ylab = "bike_share1$ride_length")

Description of the new data set.

skim_without_charts(bike_share_duplicate)
Data summary
Name bike_share_duplicate
Number of rows 5794612
Number of columns 21
_______________________
Column type frequency:
character 9
Date 2
numeric 6
POSIXct 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1.00 16 16 0 5794612 0
rideable_type 0 1.00 11 13 0 3 0
start_station_name 810999 0.86 3 64 0 1291 0
start_station_id 810996 0.86 3 37 0 1155 0
end_station_name 855317 0.85 9 64 0 1309 0
end_station_id 855317 0.85 3 37 0 1167 0
member_casual 0 1.00 6 6 0 2 0
day 0 1.00 6 9 0 7 0
month 0 1.00 3 9 0 12 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
start_date 0 1 2021-07-01 2022-06-30 2021-10-27 365
end_date 0 1 2021-07-01 2022-07-01 2021-10-27 366

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100
start_lat 0 1 41.90 0.05 41.64 41.88 41.90 41.93 42.07
start_lng 0 1 -87.65 0.03 -87.84 -87.66 -87.64 -87.63 -87.52
end_lat 0 1 41.90 0.05 41.39 41.88 41.90 41.93 42.17
end_lng 0 1 -87.65 0.03 -88.97 -87.66 -87.64 -87.63 -87.49
ride_length 0 1 1129.70 6444.15 61.00 389.00 682.00 1225.00 2946429.00
ride_distance 0 1 2.20 1.96 0.00 0.93 1.64 2.87 114.57

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2021-07-01 00:00:22 2022-06-30 23:59:58 2021-10-27 13:15:55 4850042
ended_at 0 1 2021-07-01 00:04:51 2022-07-01 18:52:41 2021-10-27 13:30:28 4850096
start_time 0 1 2021-07-01 00:00:22 2022-06-30 23:59:58 2021-10-27 13:15:55 4850042
end_time 0 1 2021-07-01 00:04:51 2022-07-01 18:52:41 2021-10-27 13:30:28 4850096

Creating functions to pot graphs.

#creating a function to create pie chart
pie_chart <- function(data, grouping_var, title) {
   tab <- table(data[[deparse(substitute(grouping_var))]])
   pie_lables<- paste0(round(100 * tab / sum(tab), 1),"%")
    pie(tab, labels = pie_lables, main = title,
      col = rainbow(length(tab)))
  legend("topright", names(tab), cex = 0.8, fill = rainbow(length(tab)))
}



#creating a function to create bar graph
bar_graph <- function(data, grouping_var, title,xtitle, ytitle){
  df <- table(data[[deparse(substitute(grouping_var))]])
  barplot(df,
          main=title,
          xlab=xtitle,
          ylab=ytitle,
          border="black",
          col=rainbow(7) 
  )
}


#creating a function to create a group graph
group_bar <- function(data, gp1, gp2,  mtitle, xtitle, ytitlt, ltitle, n=1){
  df1 <- data_frame(data %>% 
    group_by({{gp1}}, {{gp2}}) %>%
    summarise(number_of_rides = n()))

  if(n==0){
    df1[[1]] <- ordered(df1[[1]],
                        levels=c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
  }
  
  ggplot(df1,
         aes(x = df1[[1]], 
             y = df1[[3]],
             fill = df1[[2]]))+   labs(title=mtitle, x =xtitle, y =ytitlt , fill = ltitle)+
    geom_bar(stat = "identity",
             position = "dodge") +geom_text(aes(label = df1[[3]]),position = position_dodge(width = 1),
                                            vjust = -0.5, size = 2) +newtheme 


}




#creating a function to create line graph
line_plot <- function(data, gp1, gp2, mtitle, xtitle, ytitlt, ltitle){
  df1 <- data %>% 
    group_by({{gp1}},{{gp2}} ) %>%
    summarise(number_of_rides = n())
 
    df1[[1]] <- factor(df1[[1]], levels= c("July", "August", 
                                 "September", "October", "November", "December", "January","February","March", "April","May","June"))
    
    data[order(df1[[1]]), ]
 
  
 ggplot(data=df1, aes(x= df1[[1]], y = df1[[3]], group =df1[[2]], color= df1[[2]] )) +
    geom_line()+
    geom_point()+
    labs(title=mtitle,
         x =xtitle, y =ytitlt , color = ltitle)+geom_text(aes(label = df1[[3]]),position = position_dodge(width = 1),
                                                          vjust = -0.5, size = 2) +newtheme 
  
}





#creating a function to create a line graph for average distance rode
average_ride <- function(data, gp1, gp2, sumz1, mtitle, xtitle, ytitlt, ltitle, n=0){
  
  df1<- data %>%
    group_by({{gp1}}, {{gp2}})%>%
    summarise(mean({{sumz1}}))
  
  if (n==1) {
    df1[[2]] <- factor(df1[[2]], levels= c("July", "August", 
                                           "September", "October", "November", "December", "January","February","March", "April","May","June"))
    
    data[order(df1[[2]]), ]
  } else {
    df1[[2]]<- factor(df1[[2]], levels= c("Sunday", "Monday", 
                                          "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
    data[order(df1[[2]]), ]
    
  }
  
ggplot(data=df1, aes(x= df1[[2]], y = df1[[3]], group =df1[[1]], color= df1[[1]] )) +
    geom_line()+
    geom_point()+
    labs(title=mtitle,
         x =xtitle, y =ytitlt , color = ltitle) +geom_text(aes(label = as.integer(df1[[3]])),position = position_dodge(width = 1),
                                                           vjust = -0.5, size = 2) +newtheme 
}


average_ride1 <- function(data, gp1, gp2,gp3, sumz1, mtitle, xtitle, ytitlt, ltitle){
  
  df1<- data %>%
    group_by({{gp1}}, {{gp2}},{{gp3}})%>%
    summarise(mean({{sumz1}}))
  
  df1[[3]] <- factor(df1[[3]], levels= c("July", "August", 
                                         "September", "October", "November", "December", "January","February","March", "April","May","June"))
  data[order(df1[[3]]), ]
  
ggplot(data=df1, aes(x= df1[[3]], y = df1[[4]], group =df1[[2]], color= df1[[2]] )) +
    geom_line()+
    geom_point()+
    facet_wrap(vars(df1[[1]]))+
    labs(title=mtitle,
         x =xtitle, y =ytitlt , color = ltitle)+geom_text(aes(label = as.integer(df1[[4]])),position = position_dodge(width = 1),
                                                          vjust = -0.5, size = 2) +newtheme 


}

total_ride <- function(data, gp1, gp2,gp3, mtitle, xtitle, ytitlt, ltitle){
  
  df1<- data %>%
    group_by({{gp1}}, {{gp2}},{{gp3}})%>%
    summarise(number_of_rides = n())
  
  df1[[3]] <- factor(df1[[3]], levels= c("July", "August", 
                                         "September", "October", "November", "December", "January","February","March", "April","May","June"))
  data[order(df1[[3]]), ]
  
  ggplot(data=df1, aes(x= df1[[3]], y = df1[[4]], group =df1[[2]], color= df1[[2]] )) +
    geom_line()+
    geom_point()+
    facet_wrap(vars(df1[[1]]))+
    labs(title=mtitle,
         x =xtitle, y =ytitlt , color = ltitle)+geom_text(aes(label = as.integer(df1[[4]])),position = position_dodge(width = 1),
                                                          vjust = -0.5, size = 2) +newtheme 
  
  
}

Handling missing values of the names of starting stations

#filling few missing start station names
bike_share_duplicate <- bike_share_duplicate %>%
  group_by(start_lat, start_lng) %>%
  arrange(start_station_name) %>%
  fill(start_station_name) 

Creating a few tables to calculate the total number of rides from different customer types and bike types for each Start station.

#creating a table to calculate the total number of rides for each station as per membership type and bike type
bike_share_station <- bike_share_duplicate %>%
  group_by(member_casual,start_station_name)%>%
  summarise(number_of_rider= n())
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.
bike_share_station1 <- bike_share_duplicate %>%
  group_by(start_station_name)%>%
  summarise(number_of_rider= n())

bike_share_station_bike <- bike_share_duplicate %>%
  group_by(start_station_name, rideable_type)%>%
  summarise(number_of_rider= n())
## `summarise()` has grouped output by 'start_station_name'. You can override
## using the `.groups` argument.
#creating a table with the start station name and its location coordinates 
bike_share_station_location<- bike_share_duplicate %>%
  select(start_station_name, start_lat,start_lng)


#deleting duplicate values on the bases of start station names
bike_share_station_location <- bike_share_station_location[!duplicated(bike_share_station_location$start_station_name),]


#merging two tables
bike_share_station <- merge(x=bike_share_station,y=bike_share_station_location,by="start_station_name",all=TRUE)
bike_share_station <-bike_share_station[!is.na(bike_share_station$start_station_name),]


bike_share_station1<- merge(x=bike_share_station1,y=bike_share_station_location,by="start_station_name",all=TRUE)
bike_share_station1 <-bike_share_station1[!is.na(bike_share_station1$start_station_name),]


bike_share_station_bike<- merge(x=bike_share_station_bike,y=bike_share_station_location,by="start_station_name",all=TRUE)
bike_share_station_bike<- bike_share_station_bike[!is.na(bike_share_station_bike$start_station_name),]

bike_share_station <- bike_share_station%>%
  arrange(-number_of_rider,member_casual,start_station_name)
bike_share_station1 <- bike_share_station1%>%
  arrange(-number_of_rider)
bike_share_station_bike <- bike_share_station_bike%>%
  arrange(-number_of_rider)

#splitting the table into two
bike_share_station_member<- bike_share_station[bike_share_station$member_casual == "member", ]
bike_share_station_causal<- bike_share_station[bike_share_station$member_casual == "casual", ]
bike_share_station_docked_bike<- bike_share_station_bike[bike_share_station_bike$rideable_type == "docked_bike", ]
bike_share_station_classic_bike<- bike_share_station_bike[bike_share_station_bike$rideable_type == "classic_bike", ]
bike_share_station_electric_bike<- bike_share_station_bike[bike_share_station_bike$rideable_type == "electric_bike", ]

Creating a custom Theme for the Graphs.

#setting up theme for graphs
newtheme <- theme_light() +
  theme(plot.title = element_text(color = "#002949", face = 'bold', size =10),

        panel.border = element_rect(color = "#002949", size = 1),
        legend.position = "right",
        legend.text = element_text(colour="red", size=8, face="bold"),
        legend.title = element_text(colour="red", size=8, face="bold"),
        axis.title.x = element_text(colour = "#002949"),
        axis.title.y = element_text(colour = "#002949"),
        axis.text.x = element_text(angle = 45, hjust = 1, color = '#002949'),
        axis.text.y = element_text(angle = 45, hjust = 1, color = '#002949'),
        axis.line = element_line(color = "#002949", size =1),
  )

Phase 5:- Share

#piloting pie chart using a user-defined function
pie_chart(bike_share_duplicate, rideable_type, "Usage of different types of bike")

bar_graph(bike_share_duplicate, rideable_type, "Different types of bike", "Bike Type", "count")

The above pie chart and bar graph shows classic bikes are the most used bikes by the customer, followed by electric bikes.

pie_chart(bike_share_duplicate, member_casual, "Membership vs Casual")

bar_graph(bike_share_duplicate, member_casual, "Membership vs Casual", "Membership Type", "count")

The above pie chart and bar graph shows more than half of the customers are already annual members.

pie_chart(bike_share_duplicate, day, "Usage of different types of the bike during weekdays")

The above pie chart shows the weekday rides distribution for one year.

#piloting group graph using a user-defined function
group_bar(bike_share_duplicate,rideable_type, member_casual,"Total number of rides  for each bike", "Bike Type", "Number of rides", "Membership Type")
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
## `summarise()` has grouped output by 'rideable_type'. You can override using the
## `.groups` argument.
## Warning: Use of `df1[[1]]` is discouraged. Use `.data[[1]]` instead.
## Warning: Use of `df1[[3]]` is discouraged. Use `.data[[3]]` instead.
## Warning: Use of `df1[[2]]` is discouraged. Use `.data[[2]]` instead.
## Warning: Use of `df1[[3]]` is discouraged. Use `.data[[3]]` instead.
## Warning: Use of `df1[[1]]` is discouraged. Use `.data[[1]]` instead.
## Warning: Use of `df1[[3]]` is discouraged. Use `.data[[3]]` instead.
## Warning: Use of `df1[[2]]` is discouraged. Use `.data[[2]]` instead.

The above bar graph shows the total number of rides for different bikes with respect to customer type. From the graph, it is clear that classic bikes are most used, followed by electric bikes for both Annual and Casual customers.

group_bar(bike_share_duplicate,day,member_casual, "Total number of rides for each day", "Weekday", "Number of rides", "Membership Type",0)
## `summarise()` has grouped output by 'day'. You can override using the `.groups`
## argument.
## Warning: Use of `df1[[1]]` is discouraged. Use `.data[[1]]` instead.
## Warning: Use of `df1[[3]]` is discouraged. Use `.data[[3]]` instead.
## Warning: Use of `df1[[2]]` is discouraged. Use `.data[[2]]` instead.
## Warning: Use of `df1[[3]]` is discouraged. Use `.data[[3]]` instead.
## Warning: Use of `df1[[1]]` is discouraged. Use `.data[[1]]` instead.
## Warning: Use of `df1[[3]]` is discouraged. Use `.data[[3]]` instead.
## Warning: Use of `df1[[2]]` is discouraged. Use `.data[[2]]` instead.

The above bar graph shows the total number of rides throughout the weekday with respect to customer type. The annual members’ and casual customers’ behavior are different from each other. Casual members’ usage of the bike is more on weekends as compared to weekdays, whereas, for annual members, their usage is more on weekdays as compared to weekends.

#piloting line graph using a user-defined function
line_plot(bike_share_duplicate, month,member_casual,"Total number of rides", "Month", "Number of rides", "Membership Type")
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.

The above line graph shows casual and annual members’ total number of rides during 12 months. During the winter season, the number of rides decreases. Other than the month of July and August, annual members have more rides.

line_plot(bike_share_duplicate, month, rideable_type,"Total number of rides", "Month", "Number of rides", "Bike Type")
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.

The above line graph shows casual and annual members’ total number of rides during 12 months.Docked bikes have the least number of rides.

average_ride(bike_share_duplicate,member_casual, day, ride_length, "Average time driven during weekdays by different member type", "Weekday", "Average Time", "Membership Type",0)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

The above line graph shows the average time driven during weekdays by casual and annual members. During the weekdays, casual members have a high average time in comparison to annual members.

average_ride(bike_share_duplicate,rideable_type, day, ride_length, "Average time driven during weekdays by different bike type", "Weekday", "Average Time", "Bike Type",0)
## `summarise()` has grouped output by 'rideable_type'. You can override using the
## `.groups` argument.

The above line graph shows the average time driven during weekdays by three types of bikes provided by the company.

average_ride(bike_share_duplicate,member_casual, month, ride_distance, "Average distance driven during weekdays by different member type", "Month", "Average Time", "Membership Type",1)
## `summarise()` has grouped output by 'member_casual'. You can override using the
## `.groups` argument.

The above line graph shows casual and annual members’ average distance driven during 12 months.

average_ride(bike_share_duplicate,rideable_type, month, ride_distance, "Average distance driven during a year by different bike type", "Month", "Average Time", "Bike Type",1)
## `summarise()` has grouped output by 'rideable_type'. You can override using the
## `.groups` argument.

The above line graph shows the average distance driven during 12 months by three types of bikes provided by the company.

average_ride1(bike_share_duplicate,member_casual,rideable_type,month, ride_length, "Average bike rental duration for members and casual users ", "Month", "Average Time", "Bike Type")
## `summarise()` has grouped output by 'member_casual', 'rideable_type'. You can
## override using the `.groups` argument.

total_ride(bike_share_duplicate, member_casual,rideable_type, month, "Total bike rental count for members and casual users ", "Month", "Average time", "Bike Type")
## `summarise()` has grouped output by 'member_casual', 'rideable_type'. You can
## override using the `.groups` argument.

The above line graph shows the average distance driven and total number of rides during 12 months by three types of bikes provided by the company by both annual members and casual members. Casual members have more average time for the rental of all three bikes. The most used bike is Classic bikes.

#Plotting map
bike_share_station1 %>%
  mapview(
  xcol = "start_lng", 
  ycol = "start_lat",
  cex = "number_of_rider",
  alpha = 0.9, 
  crs = 4269,
  color = "blue",
  grid = F, 
  legend = T,
  layer.name = " Start Station Locations"
)+

bike_share_station1[1:10,] %>%
  mapview(
    xcol = "start_lng", 
    ycol = "start_lat",
    zcol = "start_station_name",
    cex = "number_of_rider",
    alpha = 0.9, 
    crs = 4269,
    color = "red",
    grid = F, 
    legend = T,
    layer.name = " 10 Most popular Start Stations"
    
  )

The above map shows the location of the Start stations. The size of the circle represents the popularity of the stations. The top most popular stations are highlighted in the Second layer of the map.

bike_share_station_member %>%
  mapview(
    xcol = "start_lng", 
    ycol = "start_lat",
    cex = "number_of_rider",
    alpha = 0.9, 
    crs = 4269,
    color = "blue",
    grid = F, 
    legend = T,
    layer.name = " Annual Member Start Stations" 
  )+

bike_share_station_member[1:10,] %>%
  mapview(
    xcol = "start_lng", 
    ycol = "start_lat",
    zcol = "start_station_name",
    cex = "number_of_rider",
    alpha = 0.9, 
    crs = 4269,
    color = "red",
    grid = F, 
    legend = T,
    layer.name = " 10 most popular annual member Start Stations" 
  )+


bike_share_station_causal %>%
  mapview(
    xcol = "start_lng", 
    ycol = "start_lat",
    cex = "number_of_rider",
    alpha = 0.9, 
    crs = 4269,
    color = "yellow",
    grid = F, 
    legend = T,
    layer.name = " Causal Member Start Stations"
  )+

bike_share_station_causal[1:10,] %>%
  mapview(
    xcol = "start_lng", 
    ycol = "start_lat",
    zcol = "start_station_name",
    cex = "number_of_rider",
    alpha = 0.9, 
    crs = 4269,
    color = "green",
    grid = F, 
    legend = T,
    layer.name = "  10 most popular Causal Member Start Stations"
    
  )

The above map shows the location of popular stations with respect to the member type. The most popular stations for both riders are different.

bike_share_station_classic_bike[1:10,]  %>%
  mapview(
    xcol = "start_lng", 
    ycol = "start_lat",
    zcol = "start_station_name",
    cex = "number_of_rider",
    alpha = 0.9, 
    crs = 4269,
    color = "blue",
    grid = F, 
    legend = T,
    layer.name = "10 most popular Start Stations for clasisic bike"
  )+
   bike_share_station_electric_bike[1:10,]  %>%
   mapview(
     xcol = "start_lng", 
     ycol = "start_lat",
     zcol = "start_station_name",
     cex = "number_of_rider",
     alpha = 0.9, 
     crs = 4269,
     color = "red",
     grid = F, 
     legend = T,
     layer.name = "10 most popular Start Stations for electric bike"
   )+
   bike_share_station_docked_bike[1:10,]  %>%
   mapview(
     xcol = "start_lng", 
     ycol = "start_lat",
     zcol = "start_station_name",
     cex = "number_of_rider",
     alpha = 0.9, 
     crs = 4269,
     color = "green",
     grid = F, 
     legend = T,
     layer.name = "10 most popular Start Stations for docked bike"
   )

The above map shows the location of popular stations with respect to the bike type.

Phase 6:- Act

Conclusion

  1. Increase the price of classic bike and electric bike for casual rides as most of the casual riders prefer this type of bike and give temporary offers on the docked bikes for an annual membership. This strategy could help the company to motivate casual members to buy annual memberships.

  2. Even if annual members have more rides, the average time and distance are more for casual users. So increasing the cost of the hourly and daily passes could force the casual members to buy an annual pass.

  3. Casual riders have a higher count of renting during the weekend. Increasing the rent on weekends could force them to shift to an annual membership.

  4. Introducing a special weekend membership, which can later be converted into an annual membership, will be beneficial, as casual riders are more frequent during weekends.

  5. Stations like Clark st & lincoln ave and Dusable lake shore dr & Monroe st are the busiest start stations for casual riders. A campaign to promote annual membership in these places would help in faster results.